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Abstract 

The purpose of this paper is to explain the interest and importance of (ap- 
proximate) models and model selection in Statistics. Starting from the very 
elementary example of histograms we present a general notion of finite dimen- 
sional model for statistical estimation and we explain what type of risk bounds 
can be expected from the use of one such model. We then give the performance of 
suitable model selection procedures from a family of such models. We illustrate 
our point of view by two main examples: the choice of a partition for designing a 
histogram from an n-sample and the problem of variable selection in the context 
of Gaussian regression. 

1 Introduction: a story of histograms 
1.1 Histograms as graphical tools 

Assume w6 are given a (large) set of reed valued, measurements or data x\^ . . . , 
corresponding to lifetimes of some human beings in a specific area, or lifetimes of 
some manufactured goods, or to the annual income of families in some country, .... 
Such measurements have a bounded range [a, b] which is often known in advance 
(for instance [0, 120] would do for lifetimes of human beings) or can be extrapolated 
from the data using the extreme values. By a proper affine transformation this range 
can be transformed to [0, 1] , which we shall assume here, for the simplicity of our 
presentation. To represent in a convenient, simplified, but suggestive way, this set 
of data, it is common to use what is called a histogram. To design a histogram, one 
first chooses some finite partition m = {Iq, . . . ,Id} {D E N) of [0, 1] into intervals 
Ij, generated by an increasing sequence of endpoints yo = < y\ < . . . < yD+i = 1 
so that Ij = [yj,yj + i) for < j < D and In = [yrj,yD+i]- Then, for each j, one 
computes the number rij of observations falling in Ij and one represents the data set 
by the piecewise constant function s m defined on [0, 1] by 
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■ti 3 {x) with rij = tijjxj) and = y j+1 - yj. (1.1) 
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Any such histog ram s m provides a summary of the data with three obvious properties. 
It is nonnegative; its integral is equal to one (J s m (x) dx = 1) and it belongs to the 
(D + l)-dimensional linear space V m of piecewise constant functions built on the 
partition m, i.e. 

f D 

V m = < t = y^ajljj 

I 3=0 



a ,...,a D £R\ . (1.2) 



If the points yj are equispaced, i.e. all intervals Ij have the same length (D + 1) , 
the partition and the histogram are called regular. If D > 1 and all intervals do not 
have the same length, the partition is called irregular. 

Even within this very elementary framework, some questions are in order: what is 
a "good" partition, i.e. how can one measure the quality of the representation of the 
data by a histogram, and how can one choose such a good partition? One can easily 
figure out that a partition with too few intervals, as compared with n, will lead to 
an uninformative representation. Alternatively, if there are too few data per interval 
the histogram may be quite erratic and meaningless. But these are purely qualitative 
properties which cannot lead to a sound criterion of quality for a partition which 
could be used to choose a proper one. 



1.2 Histograms as density estimators 
1.2.1 The stochastic point of view 

To go further with this analysis, we have to put the whole thing into a more math- 
ematical framework and a convenient one, for this type of problem, is of statistical 
nature. In many situations, our data X{ can be considered as successive observations 
of some random phenomenon which means that Xi = Xi(oj) is the realization of a 
random variable Xj from some probability space P) with values in [0, 1] (with 

its Borel cr-algebra). If we assume that the random phenomenon was stable during 
the observation period and the measurements were done independently of each other, 
the random variables Xi can be considered as i.i.d. (independent and identically dis- 
tributed) with common distribution Q so that 

n 

¥[{u> eQ\Xi(u) e Ai,...,X n (u) G A n }\ = Y\ Q(Ai), 

i=l 

for any family of Borel sets A±, . . . ,A n C [0,1]. Such assumptions are justified (at 
least approximately) in many practical situations and (X\, . . . ,X n ) is then called an 
n-sample from the distribution Q. 

With this new probabilistic interpretation, s m = s m (x,uj) becomes a random func- 
tion, more precisely a random element of V m , and (|1.1|) becomes 

D AT ( \ n 

s m (x,uj) = V -^Lilj^a;) with N s (u) = V mXi(cu)). (1.3) 
j=0 1 Jl i=l 

From now on, following the probabilistic tradition, we shall, most of the time, omit 
the variable ui when dealing with random elements. 

It follows from ()1.3|) that the random variables Nj are binomial random variables 
with parameters n and pj = Q(Ij) and, if we assume that Q has a density s with 
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respect to the Lebesgue measure on [0, 1], then pj = jj s(x) dx. If s also belongs to 

L2([0, l],dx), the piecewise constant element s m = X/^o^il-O l" 1 ^? of Loo([0, l],dx) 
is the orthogonal projection of s onto V m and 

and ||. - *»|» = ||. - -wlP + ll-W. - *»!», (1-4) 

hi 

where lit II denotes the L2-norm of t. 



1.2.2 Density estimators and their risk 

From a practical point of view, even if it is reasonable to assume that the variables 
Xi are i.i.d. with distribution Q and density s = dQ/dx, this distribution is typically 
unknown and its density as well and it is often useful, in order to have an idea of 
the stochastic nature of the phenomenon that produced the data, to get as much 
information as possible about the unknown density s. For instance, comparing the 
shapes of lifetime densities among different populations or their evolution with time 
brings much more information than merely comparing the corresponding expected 
lifetimes. The very purpose of Statistics is to derive information about the determin- 
istic, but unknown, parameter s from the stochastic, but observable, data Xi(to). In 
our problem, s m , which is a density, can be viewed as a random approximation of s 
solely based on the available information provided by the sample X±, . . . , X n , i.e., in 
statistical language, an estimator of s. The distortion of the estimated density s m 
from the true density s can be measured by the quantity \\s — s m || 2 . It is clearly not 
the only way but this one, as seen from (|1.4|) . has the advantage of simplicity. Note 
that || s — s m \\ is a random quantity depending on uj as s m does. In order to average 
out this randomness, the statisticians often consider, as a measure of the quality of 
the estimator s m , its risk at s which is the expectation of the distortion \\s — s m \\ 2 
given by 

R(s m ,s) = K s 

Here P s and E s respectively denote the probability and the expectation of functions of 
X\, . . . , X n when these variables are i.i.d. with density s. Of course, due to random- 
ness, R(s m , s) does not provide any information on the actual distortion \\s — s m (u>)|| 2 
in our experiment. But, by the law of large numbers, it provides a good approxima- 
tion of the average distorsion one would get if one iterated many times the procedure 
of drawing a sample Xi, . . . , X n and building the corresponding histogram. The im- 
portance of the risk, as a measure of the quality of the estimator s m also derives from 
Markov Inequality which implies that, for any z > 0, 

< z-\ (1.5) 

Hence, with a guaranteed probability 1 — the distance between s and its estimator 
is bounded by y zR(s m , s). When z is large, there are only two cases: either we were 
very unlucky and an event of probability not larger than z _1 occurred, or we were 
not and \\s — s m \\ < \J zR(s m , s). Of course, there is no way to know which of the 
two cases occured, but this is the rule in Statistics: there is always some uncertainty 
in our conclusions. 



M|| 2 dP 8 (u). 



\s - s m || > y/zR(s m ,s) 
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1.2.3 Risk bounds for histograms 

In any case, l|1.5|) shows that the risk can be viewed as a good indicator of the 
performance of an estimator. Moreover, it follows from (|1,4I) that it can be written 

as 



R(s m ,s) 



+ E, 



(1.6) 



With this special choice of distortion, the risk can be decomposed into the sum of two 
terms. The first one has nothing to do with the stochastic nature of the observations 
but simply measures the quality of approximation of s by the linear space V m since 
it is the square of the distance from s to V m . It only depends on the partition and 
the true unknown density s, not on the observations. 

The second term in the risk, which is due to the stochastic nature of the obser- 
vations, hence of s m , can be bounded in the following way, since Nj is a binomial 
random variable with parameters n and pj and both s m and s m are constant on each 
interval ly. 
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(1.7) 



This quantity is easy to bound in the special case of a regular partition since then 
\Ij\ = (D + l) -1 and we get, using the concavity of the function x i— ► x(l — x), 
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Note that D = corresponds to the degenerate partition too = {[0,1]} for which 
l[ 0il ] which is the density of the uniform distribution on [0, 1], independently 



"mo 
of s. Then s 



l[ ,i] and R(s mo ,s) 
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For general irregular partitions we derive from (|1.4j) that pj < \Ij\\\s r 
byJEZD, 
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hence, 



(1.9) 



There is actually little space for improvement in ()1.9|) as shown by the following 
example. Define the partition to by Ij = [aj, a(j + l)) for < j < D and Id = [aD, 1] 
with < a < D' 1 . Set s = s m = (aD)" 1 (1 - l/J. Then Pj = D' 1 for < j < D 
and, by (fTTfjl . 



E, 



D-1_ (D-1)1| 
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If we make the extra assumption that s belongs to L oo ([0 3 1], dx), then ||s m ||oo < I I s I loo 
and (|1.9j) becomes K s \\s m — s m \\ 2 < \\s\\oon ^D. This bound is also valid for regular 
partitions but always worse than (|1.8|) since ||s||oo ^ 1 for all densities with respect to 
Lebesgue measure on [0, 1] and strictly worse if s is not the uniform density. Finally, 
by & 



R{s m ,s) < \\s - S m \\ + 



,n~ l D. 



(1.10) 



As we shall see later the rather unpleasant presence of the unknown and possibly 
unbounded ||s||oo factor in the second term is due to the way we measure the distance 
between densities, i.e. through the L2-norm. 

1.3 A first approach to model selection 

1.3.1 An alternative interpretation of histograms 

The decomposition (|1.4|) suggests another interpretation for the construction of s m . 
What do we do here? Since s is possibly a complicated object, we replace it by a 
much simpler one s m and estimate it by s m . Note that s m is unknown, as s is, and 
what is available to the statistician is the partition m, the corresponding linear space 
V m and, consequently, the set S m of all densities belonging to V m , i.e. 
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S m = I t = h 3 (x) 
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and 
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E 

j=0 



aj\Ij\ 



(1.11) 



It is a convex subset of some D-dimensional linear space and s m is given by ||s — s m \\ = 
inf te g m \\s — t\\. It is the best approximation of s in S m . As to s m it only depends on 
the set S m and the observations in the following way, as can easily be checked: 



argmax } log(t(Xi)), 



teS., 



which means that it maximizes the so-called likelihood function t \— > Y\a=i for 
t € S m , the likelihood at t being the joint density of the sample computed at the 
observations. The estimator s m is called the maximum likelihood estimator (m.l.e. for 
short) with respect to S m . Note that, if s = s m actually belongs to S m , the m.l.e. 
converges in probability to s at rate at least as fast as n~ 1//2 when n goes to infinity 
since then, by (fT3)) . (|T1)|) and (JUJ), 
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The m.l.e. therefore appears to be a suitable estimator to use if the model S m is 
correct, i.e. if s £ S m . When we use the histogram estimator s m , we just do as if s 
did belong to 

Smi using S m as an approximate model for s. The resulting risk is then 
the sum of two terms, an approximation error equal to the square of the distance 
from s to S m and due to the fact that s does not in general belong to the model 

which is the risk corresponding to the 



S m , and an estimation term E s 



estimation within the model when s = s m since ||s m 
when the observations are i.i.d. with density s or s r 



has the same expectation 



5 



1.3.2 Model selection and oracles 

Let us denote by the regular partition with D + 1 pieces and set Sd = S mr> , 
sd = s mD and S£> = s mo , for simplicity. It follows from (|1.6|) and (|1.8j) that 

R(sd,s) < \\s - s D f + n- 1 D. (1.12) 

From the approximation point of view, a good partition should lead to a small value of 
|| s— sr>\\ which typically requires a partition into many intervals, hence a large value of 
D, while the estimation point of view requires a model Sd defined by few parameters, 
hence a small value of D. Obviously, these requirements are contradictory and one 
should look for a compromise between them in order to minimize the right-hand side 
of Q1.12J1 . Unfortunately, the value D op t which satisfies 

P - s Dopt || 2 + n~ 1 D opt = htf {||s - s D || 2 + n~ 1 D} 

cannot be computed since it depends on the unknown density s via the approximation 
term ||s— s_d|| and is not accessible to the statistician. This is why the random variable 
SD opt based on the partition rriD opt is called an "oracle" . It is not an estimator because 
it makes use of the number D op t which is unknown to the statistician. The problem of 
model selection is to find a genuine estimator, solely based on the data, that mimics 
an oracle, i.e. to use the data X\, . . . , X n to select a number D[X\, . . . , X n ) such that 
the resulting histogram s = § D has a performance which is comparable to that of the 
oracle: 

R{s,s)<C[\\s-s Dopt \\ 2 + n- 1 D opt ], 
where C is a constant that neither depends on the unknown density s nor on n. 



1.3.3 An illustrative example 

Still working with the regular partitions mc, let us now assume that the unknown 
density s satisfies some Holderian continuity condition, 

\s(x) -s(y)\ < L\x-yf, L > 0, < (3 < 1 for all x,y e [0, 1]. (1.13) 

If < j < D and x £ Ij, then sd(x) = s(y) for some y S Ij, hence \s(x) — sd(x)\ < 
L(D + 1) _/3 , from which we derive that \\s — sd\\ 2 < ||s — sd||;L> — L 2 (D + 1)~ 2/3 . 
Therefore (|1,12|) implies that R(sd, s) < n~ l D + L 2 (D + l) -2 ^. Since the minimum 
of the function x \— > n~ 1 x + L 2 x~ 213 is obtained for x = (2/3nL 2 ) 1 ^ 2/3+1 \ we choose D 
so that D + 1 is the smallest integer > (nL 2 ) 1/ '( 2/3+1 ) . if n L 2 < i ; this leads to D = 
and R(sq,s) < L 2 < nT 1 . Otherwise, and this necessarily happens for large enough 
, 1 < D < (nL 2 ) 1/{2(3+1) , hence R(s D ,s) < 2 (Ln~^) 2/(2/3+1) . Finally, in any case, 



, 2/(2/3+1) 



R(sd, s) < max |2 (^Ln 13 



Unfortunately, we can only get a risk bound of this form if we fix D as a function of 
L and /3, as indicated above. Typically, L and (5 are also unknown so that we do not 
know how to choose D and cannot get the right risk bound. The situation is even 
more complicated since, for a given s, there are many different pairs L, (3 that satisfy 
(|1.13|) . leading to different values of D and risk bounds. Of course, one would like 
to choose the optimal one which means choosing the value of D that minimizes the 
right-hand side of (|1.12|) . 
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1.4 A brief summary of this paper 

The study of histograms as density estimators shows us that a convenient method 
to estimate a complicated object as a density son [0, 1] works as follows: choose an 
approximate model S m for s involving only a limited number of unknown parameters 
and then do as if the model were correct, i.e. if s E S m , using an estimator s m which 
is a good estimator when the model is actually correct. The resulting risk is the 
sum of an approximation term which measures the quality of approximation of s by 
the model and an estimation term which is roughly proportional to the number of 
parameters needed to describe an element of the model, reflecting its complexity. 
As a consequence, a good model should be simple (described by few parameters) 
and accurate (close to the true density s). Unfortunately, because of the second 
requirement, a theoretical choice of a good model should be based on the knowledge 
of s. Given a family of possible models, a major problem is therefore to understand 
to what extent one can guess from the data which model in the family is appropriate. 

The remainder of this paper is devoted to giving some hints to justify and under- 
stand the various steps needed to formally develop the previous arguments. The next 
section will present the classical parametric theory of estimation which assumes that 
one works with the correct model and that this model satisfies some specific regularity 
conditions. Under such conditions the m.l.e. enjoys some good asymptotic properties 
that we shall recall, but this classical theory does not handle the case of approximate 
models or infinite dimensional parameters. It has therefore been extended in the 
recent years in many directions to (partly) cover such situations. We shall present 
here one such generalization that attempts to solve (at least theoretically) most of the 
difficulties connected with the classical theory. In Sectional we shall depart from the 
classical theory, assuming only an approximate model and checking on some examples 
that the results we got for histograms essentially extend to these cases with a risk 
bounded by an approximation term plus an estimation term which again leads to the 
problem of selecting a good model. Section |3] is devoted to a more general approach 
to estimation based on an approximate model with finite dimension for a suitably 
defined and purely metric notion of dimension. We show here that some specific es- 
timators (sometimes discretized versions of the m.l.e., sometimes more complicated 
ones) do lead to risk bounds of the required form: an approximation term plus an 
estimation term which is proportional to the dimension (when suitably defined) of 
the model. In the last section, we explain how to handle many such approximate 
models with finite dimensions simultaneously. Ideally, we would like to choose, using 
only the data, the best model in the family, i.e. the one with the smallest risk. This is 
unfortunately not possible, but we shall explain to what extent one can approximate 
this ideal risk. 

2 Some historical considerations 
2.1 The classical parametric point of view 

To be specific, let us assume again that our observations Xi,..., X n are i.i.d. random 
variables with an unknown density s with respect to some reference measure v defined 
on the underlying measurable set (E,£) (not necessarily the Lebesgue measure on 
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[0, 1]) so that the joint distribution P s of the observations on E n is given by 

" Pr -(x 1 , ...,x n ) = Y^s(xi). 

1=1 



In the sequel, we shall call the problem of estimating the unknown density s from the 
i.i.d. sample X\, . . . ,X n the density estimation problem or the i.i.d. framework. 

The classical parametric approach to density estimation that developed after mile- 
stone papers by Fisher (1921 and 1925) up to the sixties and is still quite popular 
nowadays is somewhat different from what we described before. It typically assumes 
a parametric model S for s, which means that the true unknown density s of our 
observations belongs to some particular set S = {tg \ 9 £ 0} of densities parametrized 



by some subset of a Euclidean space R . Then s 



for some particular 9q £ 



which is called the true parameter value. One assumes moreover that the mapping 
^ tg from to S is smooth (in a suitable sense) and one-to-one, so that estimating 
s is equivalent to estimating the parameter 9q. An estimator 9 n (Xi, . . . ,X n ) of 9q is 
then defined via a measurable mapping 9 n from E n to (with its Borel <r-algebra) 
and its quadratic risk is given by 



R\9n, #o) — E s 



where || • || now denotes the Euclidian norm in R fc . Typical examples of parametric 
models for densities on the real line are given by 

i) the Gaussian densities M (/u,cr 2 ) with 9 = Lu, a 2 ) and = R x (0, +oo) given 

by 



tg(x) 



1 



: CXp 



1 

2^2 



(x - 



V2ira 2 

ii) the gamma densities T(v, A) with 6 = (v, A) and = (0, +oo) 2 given by 

tg(x) = [r(v)]- 1 A 1 'a; , '- 1 exp[-Ax]; 
hi) the uniform density on the interval [6, 9 + 1] given by t{9) = lw e+ 1] with 9 £ 



2.2 The maximum likelihood method 

2.2.1 Consistency and asymptotic normality of the parametric m.l.e. 

Fisher's approach to parametric estimation is mainly connected with the method of 
maximum likelihood. We recall from Section fl .'A. II that the likelihood function on 
is given by 9 \—* Y\i=i tg(Xi) and a maximum likelihood estimator 9 n is any maximizer 
of this function or equivalently of the log-likelihood function 

n 

L(0) = ^log {tg{Xi)). 

i=l 

For Gaussian densities, the maximum likelihood estimator 9 n = (/i n , a 2 ) is unique and 
given by fx n = n~ l Ym=i %i and a 2 = n~ x Y17=i C^* — An) 2 - Moreover 9 n converges 
in probability to the true parameter #o when n goes to infinity. We say that 9 n is 
consistent. Unfortunately, this situation is not general. The study of our second and 
third examples show that explicit computation of the m.l.e. is not always possible 



8 



(gamma densities) or the m.l.e. may not be unique (uniform densities). One can also 
find examples of inconsistency of the m.l.e., but, as shown by Wald (1949), it can be 
proved that, under suitably strong assumptions, any sequence of maximum likelihood 
estimators is consistent. 

If the mapping 9 h-> lg(x) = log (tg(x)) satisfies suitable differentiability assump- 
tions, the parametric model is called regular. This is the case for the Gaussian and 
gamma densities, not for the uniform. If the model is regular and the m.l.e. is con- 
sistent we can expand the derivative of the function L in a vicinity of 9q when it is 
an inner point of 0. Restricting ourselves, for simplicity, to the case 8 C t, we get 

L'(9) = L'(9o) + (9- 9 )L"(9 ) + (1/2) (9 - 9 ) 2 L"'(9') 

and since 9 n is a maximizer for L, 

L'{9 n ) = = L'(9 ) + (9 n - 9 )L"(9 ) + (l/2)(0 n - 9 ) 2 L"> \9' n ) , 

for some sequence (9' n ) converging to 9q in probability as 9 n does. Equivalently, 
setting 5 n = ^/n (§ n - 9 ^j , 



i n r 1 n 

V i=l L i=l 



0~n + 



1 n 



i=l 



(2.1) 



x 



Since f tg(x) dv(x) = 1 for all 9, it follows from the regularity assumptions that 
E s [I'e^Xi)} = J [t' do (x)/te (x)}tg (x)dv(x) = J t' do (x)du(x)=0 

and 

E s [I'^Xi)] = J K (x)/t eo (x)]te () (x) du(x) - J [t' eo (x)/te () (x)] 2 t eo (x) dv{ 

= 0- j ([t' eo (x)] 2 /t eo (x)) du(x) = -I(0 O ), 

where the last equality defines the Fisher Information I(9q). Moreover 

Var [l' 9o (X t )} =K S [(4 pQ)) 2 ] = J [t' 9o (x)/t eo (x)] 2 tg (x) dv{x) = I(9 ). 

It then follows from the law of large numbers that 

1 n 

-£4' PQ)-E S [ifoXi)] =-I(9 ) 



i=i 



and from the central limit theorem that 



1 n 

^£^PQ)-Af(O,/(0 o )), 
vn i=i 



p 



where — ► and denote respectively the convergences in probability and in distribu- 
tion. The regularity assumptions also ensure that n _1 Ya=i ^0' C^*) * s asymptotically 
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bounded so that the third term in IJ2.1JI is asymptotically negligible as compared to 
the other two. We finally deduce from (|2.1|) that 

S n = \M0 n ~e )^M (0, [/(flo)r 1 ) • (2-2) 

This is the so-called asymptotic normality and efficiency of the maximum likelihood 
estimator and a formal proof of this result can be found in Cramer (1946, Sec- 
tion 33.3). It can also be proved that the asymptotic variance [/(^o)] 1 of S n is, 
in various senses, optimal, as shown by Le Cam (1953) and Hajek (1970 and 1972). 
Much less restrictive conditions of regularity which still imply the asymptotic nor- 
mality and efficiency of the m.l.e. have been given by Le Cam (1970) — see also 
Theorem 12.3 in van der Vaart (2002) — . A good account of the theory can be found 
in Ibragimov and Has'minskii (1981). A more recent point of view on the theory of 
regularity and the m.l.e., based on empirical process theory, is to be found in van der 
Vaart (1998). 

2.2.2 A more general point of view on the maximum likelihood method 

The limitations of the classical parametric theory of maximum likelihood have been 
recognized for a long time. We already mentioned problems of inconsistency. Exam- 
ples and further references can be found in Le Cam (1990). Moreover, although it is 
widely believed among non-specialists that (|2.2j) typically holds, this is definitely not 
true, even under consistency. For instance, if tg = 0~ 1 l[o,6»] is the uniform density on 
[0,6] and = (0,+oo), the m.l.e. satisfies n(0o — ra ) ~» r(l, 0o). Additional examples 
can be found in Ibragimov and Has'minskii (1981, Chapters 5 and 6) showing that 
neither the rate y/n nor the limiting normal distribution are general. 

Another drawback of the classical point of view on maximum likelihood estimation 
is its purely asymptotic nature. Not only does it require specific assumptions and can 
fail under small departures from these assumptions but it tells us nothing about the 
real performances of the m.l.e. for a given (even large) number no of observations, 
just as the central limit theorem does. Suppose that our observations X\, . . . , X n are 
i.i.d. Bernoulli variables taking only the values and 1 with respective probabilities 
1-0 O and O and G = [0, 1]. Then n = n' 1 £™ = i x i and > if < O < 1, <$n = 
y/n(6 n — 9q) ~> M(0, 0o [1 — 0o]) as expected. But it is well-known that if n = 1000 
and < 0o < 0.002, the distribution of 6 n looks rather like a Poisson distribution 
with parameter 100000 than like a normal M(8o, n _1 0o[l — 0o]) as predicted by the 
asymptotic theory. A discussion about the relevance of the asymptotic point of view 
for practical purposes can be found in Le Cam and Yang (2000, Section 7.1). 

A further limitation of the classical m.l.e. theory is the fact that the assumed 
parametric model is true, i.e. the unknown distribution of the observations has a 
density s with respect to v which is of the form tg for some 0o £ 0. If this assumption 
is violated, even slightly, the whole theory fails as can be seen from the following 
example. We assume a Gaussian distribution Pg = Af(6, 1) with density tg with 
respect to the Lebesgue measure and = R but the observations actually follow the 
distribution Q = (99Po + -P30o)/100. It is actually rather close to the Po distribution, 
which belongs to the model, in the sense that, for any measurable set A, \Q(A) — 
Pq(A)\ < 1/100. Nevertheless, the m.l.e. n -1 ^=i X i 

converges to 3 so that the 

estimated distribution based on the wrong model will be close to P3, hence quite 
different from the true distribution which is close to Pq. 
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For all these reasons, the classical approach to maximum likelihood estimation 
has been substantially generalized in the recent years. Nonparametric and semipara- 
metric maximum likelihood allows to deal with families of distributions P s where s 
belongs to some infinite-dimensional set, while sieved m.l.e. involves situations where 
the true distribution does not belong to the model. Both extensions lead to truely 
nonasymptotic results. Among the many papers dealing with such extensions, let us 
mention here Grenander (1981), Silverman (1982), Wahba (1990), Groeneboom and 
Wellner (1992), van de Geer (1993, 1995 and 2000), Birge and Massart (1993 and 
1998), Shen and Wong (1994), Wong and Shen (1995), van der Vaart and Wellner 
(1996), Barron, Birge and Massart (1999) and Massart (2006). Let us now explain 
what are the novelties brought by some of these extentions. 

3 An alternative point of view 
3.1 Nonparametric density estimation 

The assumption that the unknown density s of the observations belongs to a para- 
metric model, i.e. a smooth image of some subset of a Euclidean space, appears to 
be definitely too strong and unsatisfactory in many situations. Let us give here two 
illustrations. If we assume that s belongs to the set Si of Lipschitz densities on [0, 1] 
(i.e. s satisfies \s(x) — s(y)\ < \x — y\), one cannot represent Si in a smooth way 
by a finite number of real parameters. The same holds if we simply assume that 
s £ S2, the set of all densities in L2QO, l],dx). In this case, given some orthonormal 
basis ({pj)j>i of L2QO, l],dx), there exists a natural parametrization of £2 by ^(N*) 
(N* = N \ {0}) via the coordinates, but it is definitely not finite-dimensional. These 
two problems are examples of nonparametric density estimation problems. 

3.1.1 Projection estimators 

In order to solve the second estimation problem, Cencov (1962) proposed a general 
class of estimators called projection estimators. The idea is to estimate the coefficients 
Sj of s in the orthonormal expansion s = J^j^i s j { Pj using estimators Sj chosen in 
such a way that Ylj=i < +°° a - s - so that s = Ylj=i ^jfj belongs to L2Q0, 1], dx) 
a.s.. Since Sj = Jq s(x)ipj(x) dx = E s [<£>j(JQ)], a natural estimator for Sj is Tp^ = 

n -1 £?=i^-(*i)- Indeed 

E s [^-]=Sj and Var (^-) = n~ l Var(c^-(Xi)) < n" 1 f ip 2 j (x)s(x) dx. (3.1) 

Jo 

Assuming, for simplicity, that we take for (fj)j>o the trigonometric basis which is 
bounded by \/2, we derive that Var (ipj) < 2/n. We cannot use Y2~j~^i~<Pj < Pj as an 
estimator of s because the series does not converge. This is actually not surprising 
because we are trying to estimate infinitely many parameters (the Sj) from a finite 
number of observations. But, for any finite subset m of N*, the estimator s m = 
YljemVjVj does belong to L 2 ([0, 1], dx) and 

\\§m ~ s\\ 2 = fa ~ *if + 4 
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If we denote by \m\ the cardinality of m, we conclude from (|3.1j) that 

^•s [pm — s|| 2 ] < 2n _1 |m| + \\s m — s|| 2 with s m = s j l fj- (3-2) 

jem 

Note that s m is not necessarily a genuine estimator, i.e. a density, but this is a minor 
point since 52 is a closed convex subset of IL<2([0, 1], dx) on which we may always 
project s m , getting a genuine estimator which is even closer to s than s m . 



3.1.2 Approximate models for nonparametric estimation 

The construction of the projection estimator s m can also be interpreted in terms of 
a model since it is actually based on the parametric model 



jem 



tj G K for j G m 



To build s m , we proceed as if s did belong to S m , estimating the \m\ unknown 
parameters Sj for j G m by their natural estimators <£j. But there are three main 
differences with the classical parametric approach: 

i) we do not assume that s G S m so that S m is an approximate model for the true 
density; 

ii) apart from some exceptional cases, like histogram estimation, projection esti- 
mators are not maximum likelihood estimators with respect to S m ; 

hi) there is no asymptotic point of view here and the risk bound ()3.2|) is valid for 
any value of n. 

The histogram estimator can actually be viewed as a particular projection estima- 
tor. With the notations of Section^ we set tfj = IT^I" 1 / 2 !/. for 1 < j < D, we 
complete this orthonormal family into a basis of L2([0, l],dx) and take for m the set 
{1, . . . , D}. Then, for j G m, 

n 

Tp j = n" 1 l^r 1/2 l/i ( X i) = ^Vj-r 172 ^' and ^ <p j(pj = s m . 

i=l j&m 



3.2 Approximate models for parametric estimation 
3.2.1 Gaussian linear regression 

An extremely popular parametric model is Gaussian linear regression. In this case 
we observe n independent variables X%, . . . ,X n from the Gaussian linear regression 
set up 

p 

X i = Yl fa Z i + a & for 1 < i < n, (3.3) 
3=1 

where the random variables fa are i.i.d. standard normal while the numbers Zf, 1 < 
i < n denote the respective deterministic and observable values of some explanatory 
variable Z 3 . Here, "variable" is taken in its usual sense of an "economic variable" or 
a "physical variable". Practically speaking, Xi corresponds to an observation in the 
i th experiment and it is assumed that this value depends linearily on the values Z\ 
of the variables Z 3 , 1 < j < p in this experiment but with some additional random 



12 



perturbation represented by the random variable <r£j. We assume here that all p 
parameters f3j are unknown but that a is known (this is not usually the case but 
will greatly simplify our analysis). This set-up results in a parametric model with p 
unknown parameters, since the distribution in R n of the vector X with coordinates 
Xi is entirely defined by the parameters (3j. More precisely, the random variables 
Xi, . . . , X n are independent with respective normal distributions J\f fsj, cr 2 ) with Sj = 
Y^j=i PjZf- Equivalently X is a Gaussian vector with mean vector s = (sj)i<j< n and 
covariance matrix a 2 I n where I n denotes the identity matrix in W 1 . If we denote by 
Z 3 the vector with coordinates Z 3 and assume that the vectors Z 3 , 1 < j < p span 
a p-dimensional linear space S p , which we shall do, it is equivalent to estimate the 
parameters f3j or the vector s 6 S p . 

The estimation problem can then be summarized as follows: observing the Gaus- 
sian vector X with distribution Af (s, a 2 I n ) with a known value of cr, estimate the 
parameter s which is assumed to belong to S p . This is a parametric problem similar 
to those we considered in Section [2] and it can be solved via the maximum likelihood 
method. The density of X with respect to the Lebesgue measure on M. n and the 
log-likelihood of s are respectively given by 

n 1 n 

and - | log (27rcr 2 ) - ^ " s *) 2 > 

i=l 

so that the maximum likelihood estimator s p over S p is merely the orthogonal pro- 
jection of X onto S p with risk E s [||s — s p || 2 ] = <7 2 p. This estimator actually makes 
sense even if s S p since, whatever the true value of s £ M n , 

E s [lis - s P \\ 2 ] = cr 2 p + inf \\s - t\\ 2 . (3.4) 

The risk is the sum of two terms, one which is proportional to the number p of 
parameters to be estimated and another one which measures the accuracy of the 
model S p we use. This second term vanishes when the model is correct (contains s). 

3.2.2 Model choice again 

In the classical regression problem, the model S p is assumed to be correct so that 
E s [|| s — s p || 2 ] = a 2 p but this approach leads to two opposite problems. In order to 
keep the term a 2 p in (|3.4|) small, we may be tempted to put too few explanatory 
variables in the model, omitting some important ones so that not only s S p but 
inf tg £ || s — t|| 2 may be very large, possibly larger than a 2 n. In this case, it would be 
wiser to use the largest possible model W n for s and the corresponding m.l.e. s = X 
resulting in the better risk E s [||s — s|| 2 ] = a 2 n. In order to avoid this difficulty, we 
may alternatively introduce many explanatory variables Z 3 in the model S p . Then 
even if it is correct, we shall get a large risk bound a 2 p. It may then happen that 
only a small number q of the p explanatory variables determining the model are 
really influential. This means that if S q is the linear span of those q variables, say 
Z , . . . , Z q , inf tg ij || s — t|| 2 is small. As a consequence, the risk bound of the m.l.e. 

s q with respect to S q , i.e. a 2 q + inf te ^ ||s — i|| 2 may be much smaller than a 2 p. 

These examples show that, even in the parametric case, the use of an approximate 
model may be preferable to the use of a correct model, although a grossly wrong 
model may lead to terrible results. The choice of a suitable model is therefore crucial: 



(27R7 



2W 2 



exp 



1 n 
3» 



2a 2 
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a large model including many explanatory variables automatically results in a large 
risk bound due to the component a 2 p of the risk in Q3.4|) while the choice of a too 
parsimonious model including only a limited number of variables may result in a poor 
estimator based on a grossly wrong model if we have omitted some very influential 
variables. 

A natural idea to solve this dilemma would be to start with some large family 
{S m ,m G Ai} of linear models indexed by some set A4 and with respective dimensions 
D m . For each of them, the corresponding m.l.e. s m (the projection of X onto S m ) 
satisfies 

E s [\\s- Sm\\ 2 ] =a 2 D m + inf \\s-t\\ 2 , 

t(zSm 

and an optimal model Sm is one that minimizes this quantity. But, as in the case of 
histograms, this optimal model depends on the unknown parameter s via inf te g ||s— 
t\\ so that Sm is an "oracle", not a genuine estimator. Since this oracle is not available 
to the statistician, he has to try an alternative method and use the observation X 
to build a selection procedure m(X) of one model S m , estimating s by s = s m . An 
ideal model selection procedure should have the performance of an oracle, i.e. satisfy 

Ej||s-s|| 2 l= inf \a 2 D m + inf ||s-t|| 2 l, (3.5) 
™£M y teSm J 

but such a procedure cannot exist and the best that one can expect is to find selection 
procedures satisfying a risk bound which is close to (j3.5j) . 

4 Model based statistical estimation 

In three different contexts, namely histogram estimation for densities, projection 
estimation for densities and Gaussian linear regression, we have seen that the use 
of an approximate model associated with a convenient estimator with values in the 
model leads to three risk bounds, namely (|1.12|) . 1)3. 2 Jl and 1)3.4)) . which share the 
same structure. These bounds are the sum of two terms, one is the squared distance 
of the unknown parameter to the model, the second is proportional to the number of 
parameters that are involved in the model. One can therefore wonder to what extent 
this situation is typical. 

4.1 A general statistical framework 

Before we proceed to the solution of the problem, let us make the statistical framework 
on which we work somewhat more precise. We observe a random phenomenon X{uj) 
(real variable, vector, sequence, process, set, . . . ) from the abstract probability space 
(fl,A, P) with values in the measurable set (£,X) and with unknown probability 
distribution Pv on (H, X) given by 

P X [A] = P [X^iA)] = P[{wGfi X{uj) G A}] for all A G X. 

The purpose of statistical estimation is to get some information on this distribution 
from one observation X(lo) of the phenomenon. We assume that Pj£ belongs to 
some given subset V = {Pt, t G M} of the set of all distributions on (3, X), where M 
denotes a one-to-one parametrization of V . We moreover assume that M is a metric 
space with a distance d. Therefore Pj£ = P s for some s £ M and we want to estimate 
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P s , or equivalently s, in view of this one-to-one correspondence which also allows us 
to consider d as a distance on V as well. As in Section ri.2.21 we look for an estimator 
of s, i.e. a measurable mapping s from (H, X) to M (with its Borel a-algebra) such 
that s(X) provides a good approximation of the unknown value s. Such a mapping 
is called an estimator of s. We measure the performance of the estimator s(X) via 
its quadratic risk 



There is a very large number of possibilities for the choice of V depending on the 
structure of (H, X) and the problem we have to solve. In this paper we focus on the 
two particular but typical examples that we considered earlier, namely the density 
estimation problem and the Gaussian regression problem which amounts to the esti- 
mation of the mean of a Gaussian vector. In both cases H = E n is a product space 
with a product cr-algebra X = £® n so that X is the vector (Xi, . . . , X n ) and the Xi 
are random variables with values in (E,£). 

Density estimation For the density estimation problem we are given some refer- 
ence measure v on (E, £) and we assume that the Xi are i.i.d. random variables with 
a density s with respect to v, in which case M can be chosen as the set of all densities 
with respect to u, i.e. the subset of Li(z^) of nonnegative functions which integrate 
to one. Such a situation occurs when one replicates the same experiment n times 
under identical conditions and assumes that each experiment has no influence on the 
others, for instance when we observe the successive outcomes of a "roulette" game. 
Then, for each t £ M, Pt has the density Y\i=i K x i) with respect to fi = v® n . 

Gaussian regression This is the case that we considered in Section l3.2l with (E, £) 
being the real line with its Borel u-algebra. Here X is a Gaussian vector in lR n with 
known covariance matrix o~ 2 I n . Then M = W 1 and t = (t%, . . . ,t n ) 6 M is the 
unknown mean vector of the Gaussian distribution P t = M (t, a 2 In) with density 



with respect to the Lebesgue measure fj, on M. n . 
4.2 Two point parameter sets 

Before we come to the general situation, it will be useful to analyze a special, quite 
irrealistic, but very simple case. Let us make the extra assumption that s belongs to 
the smallest possible parameter set, i.e. a subset S of M containing only two elements 
v and u. Note that the statistical problem would be void if S contained only one point 
since s would then be known. 

A solution to this estimation problem is provided by the maximum likelihood 
method described in Section 12.21 Let fi be any measure dominating both P v and 
Pu (Pv + Pu would do) and denote by g v and g u the respective densities of P v and P u 
with respect to \x. Then define an estimator <f(X) with values in S by 




(4.1) 




(4.2) 




(4.3) 
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Take any decision you like in case of equality. If s = v, we get 

R(<p,s) = d 2 (v,u)F v [<p = u] < d 2 (v,u)F v [g u (X) > g v (X)]. 
Since the distribution of X is P s = P v = g v ■ p, F v [g v (X) > 0] = 1 and 



F v [g u (X) > g v (X)] = F v y/g u (X)/g v (X) > 1 < E„ y/g u (X)/g v (X 



V9u(x)/g v (x) g v {x) dp{x) = I v / g u (x)g v (x)dp(x). 
Hence R(<p,s) < d 2 (v , u) p(P v , P u ) with 



p(P v ,P u ) = p(P u ,P v ) =/ g &|(#(4 (4-4) 

It is easily seen that the definition of p(P v ,P u ) via (|4.4|) is independent of the choice 
of the dominating measure p. Since the same risk bound holds when s = u, we finally 
get 

sup R(0,s) <d 2 (v,u)p(P v ,P u ). (4.5) 

s£{u,v} 

This bound demonstrates the importance of the so-called Hettinger affinity p(P, Q) 
between two probability measures P and Q. It satisfies in particular by the Cauchy- 
Schwarz Inequality and the Fubini Theorem 

< p(P, Q) < 1 and p (P® n , Q® n ) = p n (P, Q). (4.6) 

It is, moreover, closely related to a well-known distance between probabilities, the 
Hettinger distance h defined by 



h2 ( P i® = \iU^W-\]^ x n dp(x) = l-p(P,Q). (4.7) 

The Hellinger distance is merely the L2 (//)-distance between the square roots of the 
densities with respect to any dominating measure p (and actually independent of p). 
Here, we follow Le Cam who normalizes the integral so that the Hellinger distance 
has range [0, 1]. An alternative definition is without the factor 1/2 in (|4.7|) . He also 
showed in Le Cam (1973) that 



p(P,Q)>/ b m{^(.);^(x)}^)>1- ^/\-p 2 {P,Q). 



(4.8) 



It is easy to compute p(P v ,P u ) for our two special frameworks. In the case of 
Gaussian distributions P u = M (u, a 2 In) and P v = M (v, a 2 In) , we get 

p(P v ,P u )=exp [-\\ v - u f / (8a 2 )] , 

so that [— log p\ l l 2 is a multiple of the Euclidian distance between parameters, modulo 
the identification of t and Pt- Note that, in general, [— log p] 1 / 2 is not a distance since 
it may be infinite and does not satisfy the triangle inequality. Setting d(v, u) = \\v—u\\, 
(|4,5j) becomes 

sup R (0, s) < \\v — u\\ 2 exp [ — 1| — n|| 2 / (8er 2 )] < 8e _1 (J 2 , 
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independently of v and u. In the i.i.d. case, we use the Hellinger distance to define the 
risk, setting d(u,v) = h(u,v) = h(P u ,P v ) and (|4.5|) becomes, whatever the densities 
v and u, 

sup R(0,8) < h 2 (v,u) [l-h 2 (v,u)] n < n n (n + l)-( n+1 ) < (ne)" 1 . 

s£{u,v} 

4.3 Two point models for the Gaussian framework 

As we pointed out at the beginning of the last section, assuming that s is either v 
or u is definitely irrealistic. A more realistic problem would rather be as follows: s 
is unknown but we believe that one of two different situations can occur implying 
that s is close (not necessarily equal) to either d or a, Then it seems natural to 
use S = {v, u} as an approximate model for s and just proceed as before, using the 
estimator <p(X) defined by (|4.3[1 . We can then try to mimic the proof which lead to 
(14,5(1 . apart from the fact that the argument leading to 

K[9u(X) > g v (X)] < p(P v ,P u ) = exp [-||t; - uf / (8a 2 )] 

then fails. One can instead prove the following result (Birge, 2006). 

Proposition 1 Let Pt denote the Gaussian distribution J\f (t, a 1 1 n ) in MP 1 . If X is 
a Gaussian vector with distribution P s and \\s — v\\ < \\v — u\\/6, then 

F s [g u (X) > g v (X)} < exp [-[|t; - u\\ 2 / (24a 2 )] . 

We can then proceed as before and conclude that, if \\s — v\\ < \\v — n||/6, then 

R(<p,a) < 2(\\s-v\\ 2 +E s [\\0-v\\ 2 ]) 

< 2 (\\s - v\\ 2 + \\v - u\\ 2 exp [-||« - u|| 2 / (24o- 2 )]) 

< 2||s-t;|| 2 + 48e _1 o- 2 . 

A similar bound holds with u replacing v if ||s — it|| < \\v — u\\/G. Finally, if min{||s — 
v\\, \\s — u\\} > \\v — u\\/Q, since (p is either v or u, 

-R(^,s) < (max{||s — v\\, \\s — u\\}) 2 

< (min{||s — v\\, \\s — u\\} + \\v — u\\) 2 

< 49(min{||s - v\\, \\s - u\\}) 2 . 

We finally conclude that, whatever s S M, even if our initial assumption that s is 
close to S is wrong, 

R (<2>, s) < 48e _ V 2 + 49 inf II* - i|| 2 , 

tes 

which, apart from the constants, is similar to (|3.4|) . 

4.4 General models for the Gaussian framework 
4.4.1 Linear models 

Instead of assuming that s is close to a two-points set, let us now assume that it 
is close to some D-dimensional linear subspace V of M n (D > 0). Choose some 
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A > 4\/3<t and, identifying V to M> D via some orthonormal basis, consider the lattice 
S = {2XL) D C V. The maximum likelihood estimator s(X) with respect to S is given 
by s(X) = argmax igjS gt(X). Its unicity follows from the facts that S is countable 
and P s [gt(X) = g u X)] = for each pair (t,u) G S 2 such that t / u. As to its 
existence (with probability one), it is a consequence of the following result. 



Proposition 2 For s an arbitrary point in M = M. n , s' G S, and 

V > 2/o = m & x |aV2D, 6||s' — s||| , 

then 

P s [it G S with \\s' -t\\>y and g t (X) > g s >(X)] < 1.14 exp 



(4.9) 



48a 2 



(4.10) 



Proof: Let S k = {t G S \ 2 k l 2 y < \\s' - t\\ < 2( fc+1 )/ 2 y} with cardinality \S k \. If 
denote by P(y) the left-hand side of (|4.10|) . we get 



wc 



+00 



P(y) < G S k with g t (X) > g sl {X)} < J2\S k \ sup¥ s [g t (X) > g 8 ,(X)]. 

k=0 k=0 ie5fc 

(4.H) 

Since, for t G S k , \\s' — t\\ > 2 k l 2 y > 6\\s' — s\\, we may apply Proposition ^ to get 



snp¥ s [g t (X) > g s ,{X)] < exp -2 k y 2 / (24a 2 ) 

t£S k 1 



(4.12) 

Moreover, for any ball B(s', r) with center s 1 and radius r = xX\HD with x > 2, 

\SnB(s',r)\ < exp [x 2 D/2] . (4.13) 

To prove this, we apply the next inequality which follows from a comparison of the 
volumes of cubes and balls in 1BL D as in the proof of Lemma 2 from Birge and Massart 
(1998). 



\SnB(s',r)\ < 



(ire/2) D / 2 ( r 



/ ttD 



XV D 



D 



+ 1 <exp[£)(0.73 + log(x + l))]. 



We then get (gZHjj) since x > 2. Applying it with r = 2 ( - k+1 ^ 2 y > 2 1+k l 2 X\fD by 
(EL71 . leads to |5 fc | < exp [2 k (y/X) 2 ] . Together with (HTTP) and (ITT21 . this shows that 



-P(y) < exp 

k=0 



9 k y^_ _ fc y 2 

X 2 24a 2 



fc=0 



exp 



48a 2 



exp 



48ct 2 



J^exp 

k=0 



48a 2 



2 k - 1 



The conclusion follows from the fact that y 2 > y\ > 2A 2 D > 2 A 2 > 96cr 2 . □ 

Proposition |21 implies that, for y > yoj there exists a set 0^ C with P s (J2y) > 
1 — 1.14exp [— y 2 1 (48<T 2 )] and such that, for uj G Q y , the function 1 1— > p t (X(o;)) has 
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a maximum in the ball B(s',y). This shows that, if a; € Q y , the m.l.e. s(X(u>)) exists 
and satisfies ||s(X) — s'\\ < y. As a consequence, the m.l.e. s(X) exists a.s. and 



Then 



IE, \\s(X) 



,'l|21 



\s(X) 



> z] dz 



< 2/o + 



+00 



\§{X)-s'\\ >^/z]dz 



< 2/0 + 1-14 



r+oc 


Z ' 


/ exp 


I 48a 2 \ 



dz 



y 2 Q + 1.14 x 48a 2 exp [-y 2 / (48a 2 )] 



< yg + 55e~V. 



E s [p(X)-s| 



s'|| 2 +E s niJ(X) 



'l|2 



< 2[|| a 

< 2 [\\s- s'f + yl + hhe- 2 a 2 ] 

< 2[37||s-s / || 2 + 2A 2 Z) + 55e- 2 a 21 



]] 



Note that the construction of S as a lattice in V implies that any point in V is at a 
distance of some point in S not larger than A\/Z) which means that one can choose 
s' in such a way that \\s — s'\\ < inf ig y \\s — t\\ + X\/jD. With such a choice for s', we 
get 

tev 



E, \\s(X) 



74 inf lis - til 2 + 76A 2 L> + 55e"V 



Setting A to its minimum value 4\/3<t, we conclude, since D > 1, that 
E s [||s(X) - s|| 2 l < 148 inf lis - til 2 + 7311cr 2 D. 



(4.14) 



4.4.2 General models with finite metric dimension 

Note that, apart from the huge constants that we actually did not try to optimize 
in order to keep the computations as simple as possible, ()4.14[) is quite similar to 
(|3.4|) . although we actually used a different estimation procedure, and also a different 
method of proof which has an important advantage: it did not make any use of the 
fact that V is a linear space. What we actually used are the metric properties of the 
D-dimensional linear subspace V of M = M. n , which can be summarized as follows. 

Property P Whatever 77 > 0, one can find a subset S of M such that: 

i) for each t £ V there exists some t' G S with ||£ — < rj; 

ii) for any ball B(t,xr]) with center t € M and radius xr/, 

\SnB(t,xr/)\ < exp [x 2 D/2] forx>2. 

In the previous example we simply defined S so that r\ = X^/D = 4o~V3D. 

The fact that the previous property of V was a key argument in the proof motivates 
the following general definition. 
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Definition 1 Let S be a subset of some metric space (M, d) and D be some real 
number > 1/2. We say that S has a finite metric dimension bounded by D if, for 
every 77 > 0, one can find a subset S v of M such that: 

i) for each t £ S there exists some t' £ S v with d(t,t') < n (we say that S v is an 
77-net for S); 

ii) for any ball B(t,xrj) with center t £ M and radius xrj, 

\S V n B{t, xrj)\ < exp [x 2 TJ\ for x>2. 

Note that any subset of S also has a finite metric dimension bounded by D. It follows 
from the Property P that a L>-dimensional linear subspace of a Euclidean space has a 
metric dimension bounded by D/2. Note that, apart from the factor 1/2, this result 
cannot be improved in view of the following lower bound for the metric dimension of 
a D-dimensional ball. 

Lemma 1 Let S be a ball of the metric space (M, d) which is isometric to a ball in 
the Euclidean space TSkP. Then a bound D for its metric dimension cannot be smaller 
than D/13. 

Proof: Let S = B(t, r) have a finite metric dimension bounded by D and rj < r/3. One 
can find S v in M which is an 77-net for S and such that N = \S v nB(t, 3rj)\ < exp [9D] . 
Moreover, S v is also an 77-net for B(t, 2rj) so that B(t, 2rj) can be covered by the N balls 
with radius 77 and centers in S v H B(t, 377) . Since B(t, 377) C S we can use the isometry 
to show, comparing the volumes of the balls, that N > 2 D so that 9D > L>log2 and 
the conclusion follows. n 

Introducing Definition ^ in the proof of Proposition^ we get the following result. 

Theorem 1 Let X be a Gaussian vector in M. n with unknown mean s and known 
covariance matrix a 2 L n . Let S be a subset of the Euclidean space W 1 with a finite 
metric dimension bounded by D. Then one can build an estimator s^(X) of s such 
that, for some universal constant C (independent of s, n and S), 



E s [||%(X)-s|| 2 ] <C 



inf \\s -t\\ 2 + a 2 D 
teS 



(4.15) 



This theorem implies that we can use for models non-linear sets that have a finite 
metric dimension. In particular, various types of manifolds could be used as models. 
To build the estimator s-g(X), we set 77 = AaV 6D and choose an 77-net S v for S 
satisfying the properties of Definition ^ Then we take for s-g(X) the m.l.e. with 
respect to S v . 

4.5 Density estimation 

When we want to extend the results obtained for the Gaussian framework to density 
estimation we encounter new difficulties. The two key arguments used in the proof of 
Proposition |21 are that V has a finite metric dimension and Proposition^ For i.i.d. 
observations X\ , . . . , X n with density s and in view of the fact that 



< exp [— nh 2 (u, v)] if s 



Li=l i=l 

an analogous result would be as follows: 
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Conjecture C Let X\, . . . ,X n be i.i.d. random variables with an unknown density s 
with respect to some measure v on (E,£). There exist two constants k > 2 and A > 
such that, whatever the densities u,v on (E,£) such that h(s,v) < K~ 1 h(u,v), then 



Hu(Xi) >f[v(Xi 

i=l i=l 



< exp [—Anh 2 (u,v)] 



If this conjecture were true one could mimic the proof for the Gaussian case, starting 
from a subset S of the metric space (M, h) with finite metric dimension, choosing a 
suitable rj-net S v for S and computing the m.l.e. with respect to S v to get an analogue 
of Theorem ^ Unfortunately Conjecture C is wrong and, as a consequence, one can 
find stuations in the i.i.d. framework where the m.l.e. with respect to S v does not 
behave at all as expected. To get an analogue of Theorem ^ for density estimation, 
one cannot work with the maximum likelihood method any more. An alternative 
method that allows to deal with the problem of density estimation has been proposed 
by Le Cam (1973 and 1975) who also introduced a notion of metric dimension, and 
then extended by the present author in Birge (1983 and 1984). In the sequel, we shall 
follow the generalized approach of Birge (2006) from which we borrow this substitute 
to Conjecture C: 

Proposition 3 Let X\, . . . ,X n be i.i.d. random variables with an unknown density 
s with respect to some measure v on (E,£). Whatever the densities u,v, one can 
design a procedure (p UyV {X\, . . . ,X n ) with values in {u, v} and such that 

¥ s [ip u>v {X x , ...,X n ) = u]< exp [-(n/4:)h 2 (u,v)] ifh(s,v) < h(u,v)/4; 

¥ s [ip UjV (Xi,...,X n ) =v]< exp [-(n/4)h 2 (u,v)] ifh(s,u) < h(u,v)/4. 

The main difference with Conjecture C lies in the fact that the procedure <Pu,v 

does not 

choose between u and v by merely comparing Y\™=i u(X{) and n?=i v (Xi). It is more 
complicated. This implies that, in this case, we have to design a new estimator s-g(X), 
based on Proposition |3J to replace the m.l.e.. The construction of this estimator is 
more complicated than that of the m.l.e. and we shall not describe it here. The 
following analogue of Theorem^ is proved in Birge (2006). 

Theorem 2 Let X = (X±, . . . ,X n ) be an i.i.d. sample with unknown density s with 
respect to some measure v on (E, £) and (M, h) be the metric space of all such densi- 
ties with Hellinger distance. Let S be a subset of (M, h) with a finite metric dimension 
bounded by D. Then one can build an estimator s-g(Xi, . . . ,X n ) of s such that, for 
some universal constant C , 



E s [h 2 <C 



inf h (s, t) + n D 

tes 



(4.16) 



Analogues of Proposition |21 do hold for various statistical frameworks, although not 
all. Additional examples are to be found in Birge (2004 and 2006). For each such 
case, one can, starting from a model S with finite metric dimension bounded by D, 
design a suitable estimator s-g(X) and then get an analogue of Theorem |^1 Within 
the general framework of Section 14.11 the resulting risk bound takes the following 
form: 

E s [d 2 (%, s)] < d inf d 2 {s, t) + C 2 D for all s G M, (4.17) 

tes 
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where the constants C\ and C2 depend on the corresponding statistical framework 
— compare with ()4.15[) and (|4.16j) — but not on s or S. The main task is indeed to 
prove the proper alternative to Proposition 01 Once this has been done, (|4,17|) follows 
more or less straightforwardly. 

To what extent can maximum likelihood or related estimators provide bounds of 
the form (|4.17f) has been studied in various papers among which van de Geer (1990, 
1993, 1995 and 2000), Shen and Wong (1994) and Wong and Shen (1995), Birge 
and Massart (1993 and 1998), Gyorfi, Kohler, Kryzak and Walk (2002) and Massart 
(2006). 

5 Model selection 

Let us consider a statistical framework for which an analogue of Proposition |3] holds 
so that any model S with finite metric dimension bounded by D provides an es- 
timator s-g(X) with a risk bounded by (|4.17j) . Then the quality of a given model 
S for estimating s can be measured by the right-hand side of (|4.17|) . Since this 
quality depends on the unknown s via the approximation term inf ie ^ d 2 (s, i), we 
cannot know it. Introducing a large family {S m ,m G Ai} of models, each one with 
finite metric dimension bounded by D m , instead of one single model, gives more 
chance to get an estimator s m = s-g m in the family with the smaller risk bound 
inf mg _A/i {Ci inf tg ^ d 2 (s, t) + C2-D m }. Since we do not know which estimator reaches 
this bound, the challenge of model selection is to design a random choice rh{X) of 
m such that the corresponding estimator approximately reaches this optimal risk, 
i.e. satisfies 

E s [d 2 (s, S7h )] <C inf (Ci inf d 2 {s, t) + C 2 D m ) , (5.1) 
for some constant C independent of s and the family of models. 

5.1 Some natural limitations to the performances of model selection 

Let us show here, in the context of Gaussian regression, that getting a bound like 
(|5.1I) for arbitrary families of models is definitely too optimistic. If, in this context, 
(|5.1|) were true, we would be able to design a model selection procedure m satisfying, 
in view of ()4.14j) 

E s [\\s - Srnf] <C inf \a 2 D m + inf ||s -tfj, (5.2) 

for some universal constant C", independent of s, n and the family of models. It is 
not difficult to see that this is impossible, even if we restrict ourselves to countable 
families of models. Indeed, if (|5.2j) were true, we could choose for {S m ,m £ M.} a 
countable family of one-dimensional linear spaces such that each point s G W 1 could 
be approximated by one space in the family with arbitrary accuracy. We would then 
get D m = 1/2 for each m and (|5.2|) would imply that 

E s [||s - Srr t \\ 2 ] < C'a 2 /2 for all s G M n . 

But it is known that the best bound one can expect for any estimator s uniformly 
with respect to s G M. n is 

sup E s [||s — s|| 2 ] = na 2 , 

sGR" 
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which contradicts the fact that C should be a universal constant. One actually has 
to pay a price for using many models simultaneously and, as we shall see, this price 
depends on the complexity (with a suitable sense) of the chosen family of models. 

5.2 Risk bounds for model selection 
5.2.1 The main theorems 

We shall not get here into the details of the construction of the selection procedure 
that we use but content ourselves to give the main results and analyze their con- 
sequences. A key idea for the construction appeared in Barron and Cover (1991). 
Further approaches to selection procedures have been developed in Barron, Birge 
and Massart (1999), Birge and Massart (1997 and 2001), van de Geer (2000), Gyorfi, 
Kohler, Kryzak and Walk (2002) and Massart (2006) who provides an extensive list of 
references. We follow here the approach based on dimension from Birge (2006), pro- 
viding hereafter two theorems corresponding to our two problems of interest, Gaussian 
regression and density estimation. In both cases, the construction of the estimators 
requires the introduction of a family of positive weights {A m ,m £ M}, to be chosen 
by the statistician and satisfying the condition 

exp [-A m ] < 1. (5.3) 

In case of equality in (|5.3|) . the family {q m }meM with q m = exp [— A m ] defines a 
probability Q on the family of models and choosing a large value for A m means 
putting a small probability on the model S m . One can then probability 
that the statistician puts on S m and which influences the result of the estimation 
procedure, as shown by the next theorems. Such an interpretation of the weights 
A m corresponds to the so-called Bayesian point of view. A detailed analysis of this 
interpretation can be found in Birge and Massart (2001, Sect. 3.4). 

Theorem 3 Let X be a Gaussian vector in M. n with unknown mean s and known 
covariance matrix cr 2 I n . Let {S m ,m £ Ai} be a finite or countable family of subsets 
ofM. n with finite metric dimensions bounded by D m , respectively. Let {A m ,m £ Ai} 
be a family of positive weights satisfying One can build an estimator s(X) of 

s such that, for some universal constant C , 

IE S [lis — s|| 2 l < C inf \ a 2 max! D m , A m \ + inf lis - til 2 1. (5.4) 

Theorem 4 Let X = (X\, . . . ,X n ) be an i.i.d. sample with unknown density s with 
respect to some measure v on (E, £) and (M, h) be the metric space of all such den- 
sities with Hettinger distance. Let {S m ,m £ Ai} be a finite or countable family of 
subsets of (M,h) with finite metric dimensions bounded by D m , respectively. Let 
{A m ,m G Ai} be a family of positive weights satisfying \5.'J\) . One can build an 
estimator s{X\, . . . , X n ) of s such that, for some universal constant C , 

E s \h 2 (s,s)] <C inf | n^ 1 max {~D m , A m \ + inf h 2 (s,t)\ . (5.5) 

{ teSm J 

Remark: The choice of the bound 1 in ()5.3|) has nothing canonical and was simply 
made for convenience. Any small constant would do since we did not provide the 
actual value of C which depends on the right-hand side of (|5.3[) . 
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5.2.2 About the complexity of families of models 

The only difference between the ideal bound 1)5.2(1 and 1)5.4(1 is the replacement of D m 
by max{D m ,A m } with weights A m satisfying ((5.3)1 and we see, comparing 1)4.16(1 
and (|5.5|) . that the same difference holds for density estimation. More generally, in a 
framework for which an analogue of Proposition holds, leading to 1)4.17(1 . we proved 
in Birge (2006) that 

E s [d 2 {s,Srh)\ < C inf \c 1 mfd 2 (s,t) + C 2 max{D m ,A m }X, (5.6) 
m€M I tes J 

holds instead of 1)5.1(1 . In all situations, apart from the constant C, the loss with 
respect to the ideal bound is due to the replacement of D m by max {D m , A m } where 
the weights A m satisfy 1)5 .3(1 . If A m is not much larger than D m for all m, we have 
almost reached the ideal risk, otherwise not and we can now explain what we mean 
by the complexity of a family of models. 

For each positive integer j, let us denote by H(j) the cardinality of the set A4j of 
those m such that j/2 < D m < (j + l)/2. If H{j) is finite for all j, let us choose 
^■m = (j + l)/2 + log + (H(j)) for m £ M.j where log + (x) = \ogx for x > 1 and 
log + (0) = 0. Then 

E ex P [-A m ] = E E ex p[-c? + x )/ 2 - lo s+(^(i))] ^ E ex pH/2] < 1 

m&M j>l mt=Mj i>2 

and 1)5.3(1 holds. Moreover, 

max {D m , A m } = A m < 2i5 m [l + j -1 log + (F(j))] for m€Mj. 

If j -1 log_|_[/f(j)] is uniformly bounded and the bound is not large, then 1)5.6(1 and 
(|5.1j) are comparable and we can consider that the family of models is not complex. 
On the other hand, if, for some j, log[iT(j)] is substantially larger than j, A m is 
substantially larger than D m , at least for some m, which may result in a bound 
1)5.6(1 much larger than ()5.1j) . If H(j) = +oo for some j, ()5.3(1 requires that A m be 
unbounded for m £ Mj, which is even worse. A reasonable measure of the complexity 
of a family of models is therefore sup,-^ j -1 log + [H (J)] , high complexity of the family 
corresponding to large values of this index. 

5.3 Application 1: variable selection in Gaussian regression 

Let us now give some concrete illustrations of more or less complex families of models 
corresponding to the examples that motivated our investigations about model selec- 
tion. To begin with, we consider the situation of Section [3.2.11 with a large number 
p < n of potentially influential explanatory variables Z J and set A = {1; . . . ;p}. For 
any subset m of A we define S m as the linear span of the vectors Z 3 for j S m. 
According to Section (4.4.21 S m has a metric dimension bounded by \m\/2. 

Let us assume that we have ordered the variables according to their supposed 
relevance, Z 1 being the more relevant. In such a situation it is natural to consider 
the models spanned by the q more relevant variables Z , . . . , Z q for 1 < q < p and 
therefore to set Ai = A4\ = {{1; . . . ; q}, 1 < q < p}. This is not a complex family of 
models and the choice A m = \m\ ensures that 1)5.3(1 holds. It follows from Theorem [21 
that one can design an estimator s(X) satisfying 

E s — s|| 2 l < C inf \a 2 \m\ + inf ||s-t|| 2 l. (5.7) 
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Comparing this with the performance of the m.l.e. with respect to each model S m 
given by (|3.4|) . we see that, apart from the constant C, we recover the performance 
of the best model in the family. 

This simple approach has, nevertheless, some drawbacks. First, we have to order 
the explanatory variables which is often not easy. Then the result is really bad if we 
make a serious mistake in this ordering. Imagine, for instance, that s only depends 
on four highly influential variables so that if the variables had been ordered correctly, 
the best model, i.e. the one minimizing a 2 \m\ + inf t6 g ||s — t\\ 2 , would be S{i : 2.3;4} 
and the corresponding risk 4a 2 . If one of these four very influential variables has 
been neglected and appears in the sequence with a high index I, it may happen that, 
because of this wrong ordering, the best model becomes 5{i ; ... ; n leading to the much 
higher risk a 2 1. 

In order to avoid the difficulties connected with variables ordering, one may intro- 
duce many more models, defining M = M2 as the set of all nonvoid subsets m of 

A. Since the number of nonvoid subsets of A with cardinality q is ( ^ ^ < p q /q\, we 

may choose A m = 1 + \m\ logp to get (|5.3I) so that, by Theorem |HJ one can find an 
estimator s(X) satisfying 



With this method, we avoid the problems connected with variables ordering and 
may even introduce more explanatory variables than observations (p > n), hoping 
that with so many variables at disposal, one can find a small subset m of them that 
provides an accurate model for s. There is a price to pay for that! We now have a 
complex family of models when p is large resulting in values of A m which are much 
larger than \m\ and we pay the extra factor logp in our risk bounds. 

One can actually cumulate the advantages of the two approaches by mixing the two 
families in the following way. We first order the p variables as we did at the beginning, 
giving the smallest indices to the variables we believe are more influential and set again 
M. = M.2- We then fix A m = \m\ + 1/2 for m G M\ and A m = 1 + \m\ logp for 
m£Ai\Mi so that (|5..3|) still holds. Theorem |21 shows that 



If our ordering of the variables is right, the best m belongs to M\ and we get an 
analogue of (|5.7|) . If not, we lose a factor logp from the risk of the best model as in 



5.4 Application 2: histograms and density estimation 

5.4.1 Problems connected with the use of the L2-distance in density es- 
timation 

Let us now come back to density estimation with histograms. In Section 11.21 we 
used the L2-distance to measure the distortion between s and its estimator. This is 
certainly the most popular and more widely studied measure of distortion for density 
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estimation but it actually has some serious drawbacks as shown by Devroye and 
Gydrfi (1985). For histograms it results in risk bounds (|l.ll)j) depending on ||s||oo for 
irregular partitions, which are not of the form 



R{s r . 



< c 



Svr 



1 2 + n 1 \m 



I] 



for some universal constant C, independent of s, n and the partition m. It is actually 
impossible to get an analogue of Theorem |2] where the L2-distance would replace the 
Hellinger distance, as shown by the following proposition motivated by Theorem 2.1 
of Rigollet and Tsybakov (2005). Indeed, if such a theorem were true, we could apply 
it to the model S provided by this proposition and conclude that the corresponding 
estimator would satisfy the analogue of (|4,16|) leading to the uniform risk bound 

E * [11% - 4 2 ] < CD/(2n), for all s e S 

and some universal constant C, therefore independent of L. This would clearly con- 
tradict (|5.9|) below for large enough values of L. 

Proposition 4 For each L > and each integer D with 1 < D < 3n, one can find 
a finite set S of densities with the following properties: 

i) it is a subset of some D-dimensional affine subspace 0/L2QO, 1], dx) with a metric 
dimension bounded by D/2; 



ii) sup 



s£S 



< L + 1; 



in) for any estimator s{X\, . . 
sample with density s £ S, 



Proof: Let us set a 
the functions f(x) = 

-l/D 



X n ) belonging to L2QO, 1], dx) and based on an i.i.d. 

(5.9) 

D/(4n) < 3/4, define 9 by (1 - 9)/9 = AnL/D and introduce 



sup E s 

ses 



> 0M39DLn 



l[o,i[(x) and g(x) 



-at 



[0,(1-0)/!?] 



Then J Q g(x) dx = 0, sup^, g{x) = L, inf x g{x) = —a > 



\9\\ 



l/D 



g 2 (x) dx = a 2 



D 



[1 + (1 



+ a(l - 
-3/4 and 

i 2 {l-9) 
9D 



-1 



](i-e)/D],i/D[- 



L 

Tn (5J0) 



It follows that ||/- (/ + 2)|| 2 
»1/D 



L I (4n) . Moreover 



h 2 (f,f + g) 



l-y/l + g(x) 



dx 



l/D 





-1 



2 + g(x) -2Vl + 5(a 



dx 



1 

D 



l/D 



< 



D 

D- 1 



[1 



(i-9)VT 



9yjl + a{l 



-1 



< D- L (2a/3) 



(6n 



\Jl + g(x) dx 



(5.11) 



since a < 3/4. Let us now set, for 1 < j < D, gj(x) = g [x — D~ l (j — 1)), so that 
these D translates of g have disjoint supports and g\ = g. Let T> = {0; 1} D with the 
distance A given by A(S, 5') = Ylf=i \$j ^ or eacn o~ we consider the density 



ss(x) = f(x) + Ylj=i 3j9j( x ) an£ i se t S = { s 8id ^ 2?}- Clearly || ||oo < L + 1 for all 
5 € T> and it follows from (|5.10|) that 

■l/D 



\ss - ss> 



D 

E 



/\2 



(Sj-S') 



9j (x) dx 



L D 



^A(S,S'). (5.12) 
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Moreover, since S is a subset of some L>-dimensional affine subspace of L2QO, 1], dx), 
it follows from the arguments used in the proof of Proposition |2 that its metric 
dimension is bounded by D/2. 

Defining P$ by dP$/dx = s$, we derive from (|5.11|) that h 2 (P$, Ps>) < (6n) _1 , hence 
p(Ps,Ps>) >p=l- (6n) -1 , for each pair (5,5') G V 2 such that A(5,5') = 1. We 
may then apply Assouad's Lemma below to conclude from (|5.12|) that, whatever the 
estimator 5 with values in T>, 



supE s Id s s - ss\ 
Sev 



L 

— sup E s 

4n sev 





L D 


A (6,6)' 




~ 4n ~2 



1 - [1 - (671)- 1 ] 



2n 



Let s be any density estimator based on X\, . . . , X n and set 5(Xi, . . . , X n ) to satisfy 
p — s?|| = inf5 G x> P — sg\\ so that, whatever 5 £ V, [|se — s<s|| < 2p — sj||. We derive 



from our last bound that 



sup E s 

Sev 



ss\\ 2 ] > 7SupE s [\\st - s s \ 
4 SeV 



> 



LD 

32ra 



1 - a/1 - [1 - (6-n)- 1 



2n 



We conclude by observing that [l — (6n) n is increasing with n, hence > 25/36. 
□ 

Lemma 2 (Assouad, 1983) Let {P$,5 6 P} fre a family of distributions indexed 
by T> = {0; 1} D and X\, . . . ,X n an i.i.d. sample from a distribution in the family. 
Assume that p(P$, Ps') > p for each pair (5, 5') £ V 2 such that A(<5, 5') = 1. Then for 
any estimator 5(X\, . . . , X n ) with values in T>, 

Dp 2n 



sup E$ 

Sev 



A ( 5(Xi 



D r 
> — 
~ 2 



1- Vl 



n2/l 



> 



(5.13) 



where E5 denotes the expectation when the Xi have the distribution P§. 



Proof: Let us set P^ for the joint distribution of the Xi with individual distribution 
Ps and consider some measure p which dominates the probabilities PV 1 for 5 £ T>. 
First note that the left-hand side of (|5.13j) is at least as large as the average risk 



Rb 



E E * [' 



Sev 



A 6,5 



E/Ep* 

5e£> fc=l 



5 • 



Then, setting Q{ = 2 D+1 E{<5ex> | <5 fe =j} P <T with J = or 1, we get 

D 



Rb 



E E 

fc=l \ {(5G x>|5 fe =0} 



4^ n + E 

{Sev l s fe =i} 



1-4 « 



> 



2 ^ 

fe=i 

1 D 

2 ^ 

k=l 



inf 



-f±dp + 
dp 

dQl.dQl 
dp ' dp 



i-s k 



dp. 



dQl 
dp 



dp 



Since inf{x;y} is a concave function of the pair (x,y), it follows that 



inf 



{dQl.dQl 
\ dp ' dp 



> 2 



-D- 



' E 



27 



with T>k = {(5,5')\5k = 0, S' k = 1, <5j = <5J for j ^ k}, hence 



We now use (|4.8|) to conclude that 



Rb>\^- d+1 E 



fe=l 



i-a/i-p 2 fe^) 



By assumption, p(P 5 ,Py) > p for (£,*') G X> fc , hence p 2 {P?,P$) = p 2n {P S ,Ps') > 
p 2n . The conclusion follows. [] 



5.4.2 Partition selection for histograms 

If we use the Hellinger distance instead of the L2-distance to evaluate the risk of 
histograms, we can improve (|l.l(Jj) . getting a universal bound which does not involve 
1 1 s | |oo- We recall that S m is the set of densities which are constant on the elements of 
the partition m as defined in 

Theorem 5 Let s be some density with respect to the Lebesgue measure on [0,1], 
X\, . . . , X n be an n-sample from the corresponding distribution and m = {Iq, . . . ,Id} 
be a partition of [0,1] into intervals Ij with respective lengths \Ij\. Let s m be the 
histogram estimator based on this partition and given by 



D 

= E 

3=0 



1 U 

n\U\ ' J 



The Hellinger risk of s m is bounded by 



IE S [h 2 (s,s m )] < 2 inf h 2 (s, t) + D/(2n). 



(5.14) 



Proof: It is shown in Birge and Rozenholc (2006) that 

D 



E s [h 2 (s, s m )] < h 2 (s, s m ) + — with s m = E 

U 3=0 



— r j s(x) dx 
1-^1 JU 



Let / be the L2-orthogonal projection of \fs onto the linear span V m of l/ , . . . , lj D . 
Then 



D 

/ = £ 

3=0 



— [ ^f~s(xjdx 
Mil JU 



1/ and ||/ - sTsf < 2h 2 (s,t) for all t G V m . 



Setting s m = ^>2f=o a-j^-ij ana ^ / = Ylf=o^j^-ij^ we S e t from Jensen's Inequality that 



bj < yfo~j. It follows that 



D f D D 

h 2 {s, s m ) = 1 — ^2 / \j o,js(x) dx = 1 — ^2 y/Qjbj\Ij\ < 1 — 

3=0 7 J 3=0 3=0 
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while 

D . D . D 

||/-^|| 2 = 1 + W b 2 J dx-2Y J b j ^/7(x)dx = l-Y J ^\Ij\- 

3=0 i 3=0 ^ 3=0 

Hence 

h 2 {s,s m ) < ||/ - ^|| 2 < 2 inf h 2 (s,t). \j (5.15) 

If, in particular, y/s is Holder continuous and satisfies (|1,13|) . we derive as in Sec- 
tion ETUI that one can find a regular partition m, depending on L and (3, such that, 

E s [h 2 (s,s m )] < max | (5/2) (Ln - ^ 2 ^^ ; n _1 J . (5.16) 

Then, a useful remark is as follows. If we have at disposal a sample X%, . . . , Xm of 
size 2n and a family M of partitions of [0,1], one can use the first half of the sample 
to build the corresponding histograms s m (Xi,... ,X n ) and use the second half of 
the sample to select one estimator in the family. For this, we merely have to apply 
Theorem |1] to the sample X n+ i, . . . , X^n conditionally on X±, . . . , X n . Conditionally 
on X\, . . . ,X n , each histogram s m is simply a density which can be considered as a 
model S m containing only one point, hence with a finite metric dimension bounded 
by 1/2. Let {A m ,m £ M} be a family of weights satisfying 

exp [— A m ] < 1 and A m > 1 for all m. (5-17) 

We derive from Theorem 0] applied to the models S m = {s m } that there exists an 
estimator s(Xi, . . . , Xm) such that 

E s [h 2 (s,s)\X 1 ,...,X n ]<C inf {n- 1 A m + h 2 (s,s m (X 1 ,...,X n ))}. 

Integrating with respect to Xi, . . . , X n and using (|5.14|) finally leads to 

E s \h 2 (s,s)] < C inf in~ 1 A m + 2 inf h 2 (s,t) + (\m\ -l)/{2n)\ 
meM y t£S m J 

< C' inf \ n _1 max{|m|, A m } + inf h 2 (s,t)} . 
meM y teSm J 

5.4.3 A straightforward application of partition selection 



(5.18) 



To give a concrete application of this result, let us introduce some special classes of 
partitions. For any finite partition m = {Iq, . . . , Id} into intervals, we denote by 
A m the set {yo < ... < yo+i}, Vo = 0, Ud+i = 1 of endpoints of the intervals Ij. 
Introducing, for k > 1, the set Jk of dyadic numbers {j'2 _fc ,0 < j < 2 fe }, we denote 
by -M-D,k-, for 1 < .D < 2 fe , the set of those partitions m which satisfy 

\m\=D + l; A m £j k and A m Jh-i- 

Denoting by tuq the trivial partition with one element [0, 1], we define M by 

a<=wu(u u m °A- 

\k>ll<D<2 k I 
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The partitions in M are dense in the set of finite partitions into intervals in the 
following sense: given any such partition m, an element t in S m , as defined by (f 1 . 111) , 
and e > 0, we can find m! £ M and t' £ S m i such that h(t, t') < e. This means 
that the approximation properties of U m e.M ^ m are ^ ne same as those of all possible 

histograms. Since \M D ,k\ < ( 2 D 1 ^ < 2 fcD , if we set A m = A^ = [(A; + l)(D + 

1) + 1] log 2 for m S Mo,k an d A mo = 1, we get 

E E E e ~ A ° m ^ E E 2— <^E 2 - fc E 2 - D = i- 

k>l l<D<2 k M D , k k>H<D<2 k fc>l D>1 

It follows that Q5.17|) holds so that by Q5.18JI . one can find an estimator s(Xi, . . . , X2 n ) 
which satisfies 

E s \h 2 (S. s)l < C inf inf inf j — + inf h 2 (s, t) 1 . (5.19) 

L J k>ll<D<2*m£M D)k \ n tGSm J 

If, in the right-hand side of (|5.19j) . we set m to be the regular partition with 2 k 
elements, which belongs to M.D,2 k -li we S e ^ a bound of the form 

■}k„-l 



kTn~ v + inf h 2 (s,t) 

t(zSm 



E s [h 2 (s,s)} < C 

For densities s with ^fs satisfying (f 1 . 13|) . we get 

E s [h 2 (§, s)] < C inf \k2 k n- 1 + L 2 2- 2k ^ , 

but an optimization with respect to k does not allow to recover the bound (|5.16|) 
because of an extra factor log (nL 2 ). This factor is connected with the complexity 
of the families Mo,k which forces us to fix A m much larger than \m\ = D + 1 for 
most elements of A4r>,k when k is large. Most, but not all! It is in particular easy 
to modify the value of A m for the regular partitions without violating ()5.17j) . If 
denotes the regular partition with 2 k elements and Mr the set of such partitions, we 
may choose A mfe = | \ instead of A^ so that 

e _Am = ^e" 2 " < 0.522 

m£M R k>0 

and IJ5.17J) still holds. It is easy to check that, with this new choice of the weights for 
the regular partitions, we improve the estimation for those densities such that ^fs is 
Holder continuous. In particular, if y/s satisfies (jl.lHj) for some unknown values of L 
and 0, 

E s [h 2 (a, a)] < C max j (Lrr^j 2/{W+l) ■ n -i I ; ( 5 . 2 0) 

which is comparable to (|5.1(ifl although L and (5 are unknown, the only loss being at 
the level of the constant C. 

5.4.4 Introducing more sophisticated Approximation Theory 

The consequences of the previous modification of the weights for partitions in Mr 
is a simple illustration of the use of elementary Approximation Theory to improve 
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the estimation of smooth densities. One can actually do much better with the use 
of more sophisticated Approximation Theory. In a milestone paper, Birman and 
Solomjak (1967) introduced a family Mt of partitions of the cube [0, l] k which are 
such that piecewise constant (and more generally piecewise polynomials) based on 
the partitions in the family have excellent approximation properties with respect 
to functions in Sobolev spaces (and functions of bounded variation when k = 1). 
Moreover, Birman and Solomjak provide a control on the number of such partitions 
with a given cardinality. For the case k = 1 which is the one we deal with here, the 
number of elements m of Aix with \m\ = D is bounded 4 D which allows us to set 
A m = 2D for those partitions. 

The algorithm leading to the construction of the partitions in A4t, which is called 
an "adaptive approximation algorithm", is also described in Section 3.3 of DeVore 
(1998) and it works as follows. We choose a positive threshold e and some non- 
negative functional J(f, I) depending on the function / to be approximated and the 
interval I. Roughly speaking, the functional measures the quality of approximation 
of / by a piecewise constant (or more generally a piecewise polynomial) function on 
I. At step one, the algorithm starts with the trivial partition m 1 = mo with one 
single interval. At step j it provides a partition m? into j intervals and it checks 
whether sup /gm j J(f, I) < e or not. If this is the case, the algorithm stops, if not we 
choose one of the intervals / for which the criterion J(f, I) < e is violated and divide 
it into two interval of equal length to derive m 3+ . Then we iterate the procedure. 
For the functions / of interest, which satisfy some smoothness condition related to 
the functional J, the procedure necessarily stops at some stage, leading to a final 
partition m. Let Mt be the set of all the partitions that can be obtained in this 
way. Then Mr C Mt- Building a partition m in Mt is actually equivalent to 
growing a complete binary tree for which the initial interval [0, 1] corresponds to the 
root of the tree, each node of the tree to an interval and each split of an interval 
to adding two sons to a terminal node of the tree, the partition m being in one-to- 
one correspondance to the set of terminal nodes of the tree. When viewed as a tree 
algorithm, this construction is similar to the CART algorithm of Breiman, Friedman, 
Olshen and Stone (1984). The analysis of CART from the model selection point of 
view that we explain here has been made by Gey and Nedelec (2005). 

It follows from the correspondence between the partitions in Mt an d the complete 
binary trees that the number of elements m of Mt such that \m\ = j + 1, j £ N, 
is equal to the number of complete binary trees with j + 1 terminal nodes which is 

given by the Catalan numbers (j + 1) _1 f ^ ' V Setting A^ = 2\m\ for m & Mt 
and using f ^? ^ < 4? which follows from Stirling's expansion, we derive that 

V < V e " 2 °' +1) ( 2j )<T 4Je ~ 2ij+1) < 1 

- j + i \ j 1 - j + i 4 

mGM T j>0 j>0 J 

It follows that Q5.17JI holds if we set A m = A^ for m G Mr and A m = A^ for 
m £ M\ Mt and we then derive from (|5,18|) that not only (|5.19|) still holds but also 

E s \h 2 (s, s)l < C inf | n'Hml + inf h 2 (s, t) 1 , 

which is indeed a substantial improvement over (|5.19|) . In particular, since Mt 
contains A4r, ()5.20|) still holds when ^/s is Holderian, but the introduction of the 
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much larger class Mt leads to a much more powerful result which follows from the 
approximation properties of functions in V m given by (|1.2|) with m G Mt- We refer 
the reader to the book by DeVore and Lorentz (1993) for the precise definitions of 
Besov spaces and semi-norms and the variation Var* in the following theorem. 

Theorem 6 Let Mt be the set of partitions m of [0,1] previously defined. For any 
p > 0, a with 1 > a > (1/p — 1/2)+, any positive integer j and any function t 
belonging to the Besov space Bp tOO ([0, 1]) with Besov semi-norm l^st*^, one can find 
some m G M.t with \m\ = j and some t' G V m such that 

\\t-t'\\ 2 <C(a,p)\t\ B? J- a , (5.21) 

where \\ ■ \\^ denotes the I^2(dx)-norm on [0, 1]. 

Ift is a function of bounded variation on [0, 1], there exists m G Mt with \m\ = j 
and t' G V m such that \\t - t'\\ 2 < C Var*(t)j _1 . 

The bound ()5.21|) is given in DeVore and Yu (1990). The proof for the bounded 
variation case has been kindly communicated to the author by Ron DeVore. 

Applying the previous theorem to t = \/s, we may always choose for t' the pro- 
jection of y/s onto V m and it follows from (|5.15j) that the result still holds with 
t' = y]~Sm~- In particular, if \fs G i?p jOO ([0, 1]), then for a suitable m with \m\ = j, 
h 2 (s, s m ) < C(a,p)\t\ 2 Ba j~ 2a . Putting this into (|5.18j) with A m = 2j and optimizing 
with respect to j shows that 

f / \ 2/(2a+l) , "I _ 

E s [h 2 (s,s)]<Cm a x^{\t\ B?ao n- a ) ;n _1 | if yfs G B« >oo ([0, 1]). 

Similarly, we can show that 
E s [h 2 (s, a)] < C max {(Var*(^)/n) 2/3 ;n^} if \fs has a bounded variation. 

5.5 Model choice and Approximation Theory 

In any statistical framework for which we can prove a risk bound of the form (|5.6|) 
provided that Q5.3[) holds, the technical problem of model selection can be considered 
as being solved but the question of how to choose the family of models to which we 
shall apply the procedure remains. There is no general recipe to make such a choice 
without any "a priori" information on s. If we have some information about the 
true s or at least we suspect that it may have some specific properties, or if we wish 
that some particular s should be accurately estimated, we should choose our family 
of models in such a way that the right-hand side of (j5.6j) be as small as possible 
for the s of interest. Finding models of low dimension with good approximation 
properties for some specific functions s is one purpose of Approximation Theory. 
One should therefore base our choice of suitable families of models on Approximation 
Theory, which accounts for the numerous connections between modern Statistics and 
Approximation Theory. 

We may also have the choice between several families of models with different ap- 
proximation properties and complexity levels. Typically, the more complex families 
have better approximation properties but we have to pay a price for the complexity. 
A good example is the alternative regular versus irregular partitions for histograms. 
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As shown in the previous sections, it is possible to mix families with different approxi- 
mation and complexity properties by playing with the weights A m . In particular, it is 
important that as many models as possible, and particularly those with good approx- 
imation properties with respect to functions of greater interest, do satisfy A m < c\m\ 
for some fixed constant c. The introduction of the family of models {S m ,m £ Mt} 
in Section 15.4.41 illustrates this fact. These models, which have especially good ap- 
proximation properties with respect to a large class of Besov spaces, form a much 
richer class than those soleley based on regular partitions. Nevertheless, the number 
of such models with dimension D remains bounded by exp[c'-D], which allows to fix 
A m of the order of D for these models. By (|5.6[l . this implies that, when we use 
such a family of models, the performance of the estimator based on model selection 
is almost (up to constants) as good as the performance of the estimator based on the 
best individual model. 

A detailed analysis of the problems of model choice is given in Section 4. 1 of Birge 
and Massart (2001) which also provides additional information about the relationship 
between model selection and Approximation Theory. Further results in this direc- 
tion are to be found in Barron, Birge and Massart (1999). It follows from these 
presentations that all results in Approximation Theory that describe precisely the 
approximation properties of some particular classes of finite dimensional models are 
of special interest for the statistical applications we have in mind. Statistics has been 
using various approximation methods and we would like to emphasize here two main 
trends. One is based on approximation of functions by piecewise polynomials (or sim- 
ilar functions like splines), some major references here being Birman and Solomjak 
(1967) and the book by DeVore and Lorentz (1993). The statistical methods based 
on this approach to approximation lead to estimators which are generalizations of 
histograms, the selection procedure handling the choice of the partition (and also, 
possibly, the degree of the polynomials). Another trend is based on the expansion 
of functions on suitable bases, formerly the trigonometric basis, more recently bases 
derived from a multiresolution analysis (wavelet bases and the like). The related 
estimators are based on the estimation of the coefficients in the expansion and the 
selection chooses the finite set of coefficients to be kept in the expansion of the fi- 
nal estimator. Statistical procedures based on wavelet thresholding are of this type. 
Theorem based on DeVore and Yu (1990) provides a set of partitions which are 
relevant for approximation of functions in Besov spaces. A parallel result by Birge 
and Massart (2000) applies to the second approach, providing a family of subsets of 
coefficients to keep in order to get similar approximation properties. A good overview 
of nonlinear approximation based on wavelets or piecewise polynomials with many 
useful references is to be found in DeVore (1998). 

The use of metric entropy or dimensional arguments in Statistics is not new. The 
first general results connecting the metric dimension of the parameter set to the 
performance of estimators are given by Le Cam (1973 and 1975) and statistical ap- 
plications of the classical entropy results by Kolmogorov and Tikhomirov (1961) are 
developed in Birge (1983). An up to date presentation with extensions to model 
selection following ideas by Barron and Cover (1991) is in Birge (2006). There is 
also a huge amount of empirical process literature based on entropy arguments with 
statistical applications. Many illustrations and references are to be found in van der 
Vaart and Wellner (1996), van der Vaart (1998), van de Geer (2000) and Massart 
(2006). More generally, connexions between estimation and Approximation Theory, 
in particular via wavelet thresholding, have been developed in many papers. Besides 
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the authors' works already cited, a short selection with further references is as fol- 
lows: DeVore, Kerkyacharian, Picard and Temlyakov (2004), Donoho and Johnstone 
(1994, 1995, 1996 and 1998), Donoho, Johnstone, Kerkyacharian and Picard (1995, 
1996 and 1997), Kerkyacharian and Picard (1992 and 2000) and Johnstone (1999). 
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